About the project

I am very excited about this course. I think I will learn a lot of interesting and very useful things. I study statistics, so I want this course be a bit more practical that my other studies.

Johanna Tuhkanen recommended me this course. Also, I got a letter from Kimmo Vehkalahti advertising this course.

Exercise 2

The data reading and exploring

students14 <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/learning2014.txt", sep=",", header=TRUE)
head(students14)
##   gender age attitude     deep  stra     surf points
## 1      F  53      3.7 3.583333 3.375 2.583333     25
## 2      M  55      3.1 2.916667 2.750 3.166667     12
## 3      F  49      2.5 3.500000 3.625 2.250000     24
## 4      M  53      3.5 3.500000 3.125 2.250000     10
## 5      M  49      3.7 3.666667 3.625 2.833333     22
## 6      F  38      3.8 4.750000 3.625 2.416667     21
str(students14)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...
dim(students14)
## [1] 166   7

The data is collected from the survey on teaching and learning. In this data, we have 166 students’ answers. In each observation, we have a responder’s age and gender. Also, we have 4 variables that measure student’s deep, surface and strategic approach to learning and his/her attitude toward statistics. The last variable “points” tells us his/her exam score.

library(ggplot2)
library(GGally)

summary(students14)
##  gender       age           attitude          deep            stra      
##  F:110   Min.   :17.00   Min.   :1.400   Min.   :1.583   Min.   :1.250  
##  M: 56   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333   1st Qu.:2.625  
##          Median :22.00   Median :3.200   Median :3.667   Median :3.188  
##          Mean   :25.51   Mean   :3.143   Mean   :3.680   Mean   :3.121  
##          3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083   3rd Qu.:3.625  
##          Max.   :55.00   Max.   :5.000   Max.   :4.917   Max.   :5.000  
##       surf           points     
##  Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.417   1st Qu.:19.00  
##  Median :2.833   Median :23.00  
##  Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :4.333   Max.   :33.00
ggpairs(students14, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))

ggpairs(students14, mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)))

Let us first describe the distributions of the variables. Then we will take a look at their connection to each other, especially to the “points” variable.

The “gender” variable is nominal. It means that it can take only 2 values, 0 and 1. 0 stands for woman and 1 - for man. According to the plot, we have more women answered this survey than men. Also, we notice that on average, our responders are quite young.

With this data, it is very hard to notice any corrections between this variable. Fortunately, the plot helps us. The strongest correlation happens between “attitude” and “points” variables. Additionally, there is a negative correlation (-0.3, not very strong) between deep and surface approaches.

Structural and surface approaches correlate with points, as well. I think that those variables and “attitude” should be chosen as explanatory variables for a regression model where exam points are the response variable.

Choosing the right model

Let us start with this model: y_i = b_0 + b_1*x_(attitude, i) + b_2*x_(stra, i) + b_3*x(surf, i)
We are going to fit it now.

model <- lm(points ~ attitude + stra + surf, data = students14)
summary(model)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = students14)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## stra          0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

So we get the following results: - Regression coefficients are 3.4 for “attitude”, 0.9 for “stra” and -0.6 for “surf”. - However, the p-values for “stra” and “surf” are very big (>0.1). It means that those regression coefficients may be zero. So these 2 variables are not statistically significant with the target variable. - The P-value for “attitude” is almost zero, which means this variable has a statistically significant relationship with the “point” variable.

Let us remove statistically no-significant variables and fit the model again.

model <- lm(points ~ attitude, data = students14)
summary(model)
## 
## Call:
## lm(formula = points ~ attitude, data = students14)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

In this model, we have only the “attitude” variable. The regression coefficient is 3.5. It means that when a student’s attitude towards statistics grows by one point, a student gets 3.5 points more. The multiple R-squared of this model is 0.19, which means that the “attitude” variable explains 20% of the variation in the “points” variable.

Model diagnostics

When we use this model, we make 2 assumptions: 1. The variable “points” is normally distributed 2. e_i is normally distributed with the mean of 0 and constant variance.

The plots we are going to plot now will help us prove this assumption.

par(mfrow = c(2,2))
plot(model, which = c(1,2,5))

Normal Q-Q shows that there is little evidence of a departure from linearity. That means that there is no evidence that the “points” variable is not normally distributed.

“Residuals vs Fitted” plot does not show that there is the variance of the residuals increases on decreases with the fitted values. So we can say that the variance of e_i is constant.

Leverage measures how much impact a single observation has on a model. One of observation’s leverage is 0.04. It is very small. So we can say that this data does not have any outliers that would affect the model.


Exercise 2

Data reading and exploring

alc <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/alc.txt", sep = "," , header=TRUE)
colnames(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "nursery"    "internet"   "guardian"   "traveltime"
## [16] "studytime"  "failures"   "schoolsup"  "famsup"     "paid"      
## [21] "activities" "higher"     "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

This data is about the students of 2 Portuguese schools. It contains basic information, like gender and age, about the students. Also, it tells about the alcohol consumption of the students.

About my study

The purpose of my analysis is to study the relationships between high/low alcohol consumption and the following variables: 1. the student’s age 2. the student’s school 3. the quality of student’s family relationships 4. the student’s plans about taking higher education

I think that age might have a significant effect on the consumption of alcohol. I hypothesize that as a student becomes older, he or she drink more alcohol. Also, I am interested to know if there is any difference in alcohol consumption between these 2 schools. It might be that one of these schools is located in a not so good area, which makes alcohol consumption bigger. My hypothesis about the effect of a student’s family relationship is that students who have a good relationship with their family are less prone to drink than the students whose relationships are not that good. My 4th explanatory variable is the student’s plans about taking higher education. I think that if a student is planning to study further, he or she more focus on their studies and drink less than others.

The explanatory variables and their relationships with alcohol consumption

Let us start with the variable “age”.

library(ggplot2)
ggplot(data = alc, aes(x = age, fill = high_use)) + geom_bar()

What we see is that we have a pick at the age of 17.

The second variable is school.

ggplot(data = alc, aes(x = school, fill = high_use)) + geom_bar()

It’s quite hard to say if alcohol consumption is bigger in one of the schools. In our further analysis, it might turn out that this variable and our response variable are independent.

The next graph is about alcohol consumption and family.

ggplot(data = alc, aes(x = famrel, fill = high_use)) + geom_bar()

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
## 
##     nasa
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
t <- table(family = alc$famrel, high_use = alc$high_use)
t
##       high_use
## family FALSE TRUE
##      1     7    2
##      2     9    9
##      3    40   26
##      4   131   52
##      5    83   23
for (i in 1:nrow(t)) {
  sum <- t[i,1] + t[i,2]
  t[i, 1] <- t[i,1]/sum
  t[i, 2] <- t[i,2]/sum
}
t
##       high_use
## family     FALSE      TRUE
##      1 0.7777778 0.2222222
##      2 0.5000000 0.5000000
##      3 0.6060606 0.3939394
##      4 0.7158470 0.2841530
##      5 0.7830189 0.2169811

Based on this last table, we can say that there is a negative correlation between this variable and the response variable.

Also, let us see if students who want to continue their studies are not heavy drinkers.

ggplot(data = alc, aes(x = higher, fill = high_use)) + geom_bar()

t <- table(family = alc$higher, high_use = alc$high_use)
t
##       high_use
## family FALSE TRUE
##    no      9    9
##    yes   261  103
for (i in 1:nrow(t)) {
  sum <- t[i,1] + t[i,2]
  t[i, 1] <- t[i,1]/sum
  t[i, 2] <- t[i,2]/sum
}
t
##       high_use
## family    FALSE     TRUE
##    no  0.500000 0.500000
##    yes 0.717033 0.282967

It seems that my hypothesis is quite true.

The logistic regression

Not it is time to use logistic regression to statistically explore the relationship between the variables and the binary high/low alcohol consumption variable as the target variable.

model <- glm(high_use ~ age + school + famrel + higher, data = alc, family = "binomial")
summary(model)
## 
## Call:
## glm(formula = high_use ~ age + school + famrel + higher, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3740  -0.8320  -0.7316   1.3546   1.8605  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -2.01814    1.96224  -1.028   0.3037  
## age          0.17710    0.10748   1.648   0.0994 .
## schoolMS    -0.01764    0.38515  -0.046   0.9635  
## famrel      -0.29855    0.12189  -2.449   0.0143 *
## higheryes   -0.68147    0.50601  -1.347   0.1781  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 462.21  on 381  degrees of freedom
## Residual deviance: 449.89  on 377  degrees of freedom
## AIC: 459.89
## 
## Number of Fisher Scoring iterations: 4

Based on the result we got from fitting the model, the only variable “famrel” is statistically significant.

The odd ratios of the coefficients and their 95% confindence intervals are below:

exp(cbind(coef(model), confint(model)))
## Waiting for profiling to be done...
##                             2.5 %    97.5 %
## (Intercept) 0.1329027 0.002799011 6.2614310
## age         1.1937562 0.967246744 1.4759900
## schoolMS    0.9825138 0.452417639 2.0657936
## famrel      0.7418952 0.583097300 0.9419763
## higheryes   0.5058743 0.185373265 1.3859659

Here we have one more evidence of the fact that only “famrel” is statistically significant. Its confidence interval does not contain 1 and so it does have an effect on alcohol consumption. Others’ confidence interval contains 1, so there is no evidence that they really affect our response variable.

The predictive power of the model

Now we are going to create a model that contains only “farmel” as an explanatory variable.

model <- glm(high_use ~ famrel, data = alc, family = "binomial")
summary(model)
## 
## Call:
## glm(formula = high_use ~ famrel, family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.1631  -0.8220  -0.7244   1.4489   1.7125  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)   0.2587     0.4744   0.545   0.5855  
## famrel       -0.2925     0.1197  -2.445   0.0145 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 462.21  on 381  degrees of freedom
## Residual deviance: 456.24  on 380  degrees of freedom
## AIC: 460.24
## 
## Number of Fisher Scoring iterations: 4

So, let us explore the predictive power of this model.

probabilities <- predict(model, type = "response")
alc <- mutate(alc, probability = probabilities)
alc <- mutate(alc, prediction = probabilities > 0.5)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2931937
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table() %>% addmargins()
##         prediction
## high_use     FALSE       Sum
##    FALSE 0.7068063 0.7068063
##    TRUE  0.2931937 0.2931937
##    Sum   1.0000000 1.0000000
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = model, K = 10)
cv$delta[1]
## [1] 0.3010471

The average number of wrong predictions is 0.29, which is bigger than the one in datacamp exercise.


Exercise 4

Data downloading and exploring

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
data(Boston)
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

This dataset contains information collected by the U.S Census Service concerning housing in the area of Boston Mass. The dataset is small in size with only 506 cases. There are 14 attributes in each case of the dataset. They are:

  • CRIM - per capita crime rate by town
  • ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
  • INDUS - it serves as a proxy for the externalities with industry (noise, heavy traffic)
  • CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
  • NOX - nitric oxides concentration (parts per 10 million)
  • RM - average number of rooms per dwelling
  • AGE - is related to structure quality
  • DIS - weighted distances to five Boston employment centres
  • RAD - index of accessibility to radial highways
  • TAX - full-value property-tax rate per $10,000
  • PTRATIO - pupil-teacher ratio by town
  • B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
  • LSTAT - % lower status of the population
  • MEDV - Median value of owner-occupied homes in $1000’s

All varaibles are numerical. No every variable is continious.

Overview of the data

library(tidyverse)
library(corrplot)
corr_matrix <- cor(Boston)
corrplot(corr_matrix, method = 'square', type = 'upper', cl.pos = 'b', tl.cex = 0.6)

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Because the pairs plot is very complex this time, I decided to explore a corrplot at first. To make this plot I calculated the correlation matrix first and then created this plot with the function “corrplot” from the package “corrplot”.

What I notice from this plot is

  • Next pairs have strong negative correlations:
    • lstat and medv
    • age and dis
    • age and nox
    • age and indus
  • And these pairs have a strong positive correlation:
    • rad and tax (very strong!)
    • indus and tax
    • indus and nox

For example, we can take a look at the relationship of rad and tax.

ggplot(data = Boston, aes(x = rad, y = tax)) +
  geom_point()

I think we just found the reason for such a strong correlation. The outlier in the right upper corner might be the reason for it. We can test what happens if we remove it.

library(dplyr)
cor_test <- as.data.frame(cbind(Boston$rad, Boston$tax))
cor_test <- filter(cor_test, V1 < 20)
cor(cor_test)
##           V1        V2
## V1 1.0000000 0.1882562
## V2 0.1882562 1.0000000

Removing this outlier the correlation becomes about 0.2. So my hypothesis about why their correlation is so strong was right.

Standardising the dataset

To go on we need to standardise or to scale our dataset. It can be made easily with the function “scale”.

b_scaled <- scale(Boston)
summary(b_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865
b_scaled <- as.data.frame(b_scaled)

Scaled variables have 0 as their means. Scaling does not affect the variances. Also, I changed out dataset from a matrix to a data frame. I think it will be useful later.

New variable CRIME

Now we create a categorical variable of the crime rate. We are going to use the quantiles as the break points in this variable.

library(dplyr)
breaks <- quantile(b_scaled$crim)
crime <- cut(b_scaled$crim, breaks = breaks, include.lowest = TRUE, labels = c('low', 'med_low', 'med_high', 'high'))
b_scaled <- b_scaled[,2:14]
b_scaled <- data.frame(b_scaled, crime)

With the last 4 lines of code I deleted the column “crim”, add our new column “crime” and removed “crime” column from the test set.

Dividing the dataset to train and test sets

For our future analysis we need to divide our data to train and test sets. We are going to do it now. We are going to put 80 per cents of the data into the train set.

ind <- sample(nrow(b_scaled), size = nrow(b_scaled)*0.8)
train <- b_scaled[ind,]
test <- b_scaled[-ind,]
correct_classes <- test$crime
test <- test[,1:13]

Fitting the LDA

We are going to fit the linear discriminant analysis on the train set. We are using the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables.

b_lda <- lda(crime ~ ., data = train)

The LDA plot are below.

b_arrows <- function(x, myscale = 1, arrow.heads = 0.1, color = 'pink', tex = 0.75, choices = c(1,2)) {
  heads = coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale*heads[,choices[1]],
         y1 = myscale*heads[,choices[2]],
         col = color, length = arrow.heads)
  text(myscale*heads[,choices], labels = row.names(heads), cex = tex, col = color, pos = 3)
}
classes = as.numeric(train$crime)
plot(b_lda, dimen = 2, col = classes)
b_arrows(b_lda, myscale = 2)

Predicting the classes

Then we are going to predict the classes with the LDA model on the test data.

b_predict <- predict(b_lda, newdata = test)
t <- table(correct = correct_classes, predicted = b_predict$class)
t
##           predicted
## correct    low med_low med_high high
##   low       12      11        4    0
##   med_low    5      15        5    0
##   med_high   1       2       23    3
##   high       0       0        0   21

I think the model predicted the best class “high”. Maybe it is because this class defers a lot from other classes. To in which class we have the most errors we can create the same table but with the percentages.

library(tidyverse)
c1 <- t[,1]/colSums(t)[1]
c2 <- t[,2]/colSums(t)[2]
c3 <- t[,3]/colSums(t)[3]
c4 <- t[,4]/colSums(t)[4]
t2 <- cbind(c1, c2, c3, c4)
colnames(t2) <- c('low', 'med_low', 'med_high', 'high')
t2
##                 low    med_low med_high  high
## low      0.66666667 0.39285714  0.12500 0.000
## med_low  0.27777778 0.53571429  0.15625 0.000
## med_high 0.05555556 0.07142857  0.71875 0.125
## high     0.00000000 0.00000000  0.00000 0.875

The prediction of the class “med_low” caused the biggest number of wrong predictions. And predictions of the class “high” were the most correct.

K-means clustering

Now we are going to do k-means clustering on the data “Boston”. First of all, we have to decide how many clusters should we have. We can do by looking at how total WCSS changes when the number of clusters grows.

data(Boston)
b_scaled <- scale(Boston)
dist <- dist(b_scaled)

set.seed(123)
k_max <- 12
twscc <- sapply(1:k_max, function(k){
  kmeans(b_scaled, k)$tot.withinss
})
qplot(x = 1:12, y = twscc, geom = 'line')

I think 2 is the optimal number of clusters.

Let’s do the k-means clustering with 2 clusters and interpret the results. To be honest it isn’t easy to interpret the results. Anyway, my thoughts:

  • indus is higher in class 2 (pink) -> housings are located in more industrial areas ()
  • nox is higher in class 2 -> air pollution
  • age is higher in class 2 -> quality of housings is better in class 2
  • dis is higher in class 2 -> housings are located away from to employement centers
  • tax is higher in class 2 -> the public services are more expensive in class 2
  • medv is lower in class 2 -> median value of homes are lower in this class

Based on these thoughts, class 2 includes not very attractive houses (bad air, located next to factories). The houses of class 1 are located better, the air is better and that’s why their price is usually higher.


Exercise 5

Overview

Let us first take a look at the data. Let us see what variable it has, what kind of variable they are. Also we will check if there is any correlations between the variables.

human <- read.csv('human.csv',row.names = 1)
str(human)
## 'data.frame':    155 obs. of  8 variables:
##  $ eduFM    : num  1.007 0.997 0.983 0.989 0.969 ...
##  $ labFM    : num  0.891 0.819 0.825 0.884 0.829 ...
##  $ ExpEdu   : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ ExpLife  : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ GNI      : int  64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
##  $ MMR      : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ BirthRate: num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ PP       : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...

The variables stand for:

  • eduFM = (Proportion of females with at least secondary education) / (Proportion of males with at least secondary education)
  • labFM = (Proportion of females in the labour force)/(Proportion of males in the labour force)
  • ExpEdu = Expected years of schooling
  • ExpLife = Life expectancy at birth
  • GNI = Gross National Income per capita
  • MMR = Maternal mortality ratio
  • BirthRate = Adolescent birth rate
  • PP = Percetange of female representatives in parliament
library(dplyr)
library(ggplot2)
library(GGally)

ggpairs(human)

cor(human) %>%corrplot::corrplot()

We notice that there is a highly negative correlation between MMR and ExpLife. That means that the bigger chance to die during the birth the shorter life is. Also, ExpEdu (Expected years of schooling) and ExpLife (Life expectancy at birth) have a strong positive correlation.

PCA on the non standardized data

Now it is time to perform PCA. I am going to do it on the non standardized data.

pca_non_sta <- prcomp(human)
summary(pca_non_sta)
## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000
##                           PC8
## Standard deviation     0.1591
## Proportion of Variance 0.0000
## Cumulative Proportion  1.0000
biplot(pca_non_sta, choices = 1:2)

The results we got don’t make sense at all. The reason for it is that our data is not standardized.

PCA on the standardized data

Let us do PCA on the standardized data this time.

human <- scale(human) # standardizing
pca_sta <- prcomp(human)
summary(pca_sta)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6
## Standard deviation     2.0708 1.1397 0.87505 0.77886 0.66196 0.53631
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595
## Cumulative Proportion  0.5361 0.6984 0.79413 0.86996 0.92473 0.96069
##                            PC7     PC8
## Standard deviation     0.45900 0.32224
## Proportion of Variance 0.02634 0.01298
## Cumulative Proportion  0.98702 1.00000
biplot(pca_sta, choices = 1:2, cex = c(0.5, 1), col = c("pink", "black"))

The analysis on the standardized data produces more realstic and more useful results than the previous one did.

Interpretation

The high maternal mortality ratio and the high adolescent birth rate are one of the signs of the developing country. On the other hand, hign education, long life and hinf GNI are something that is often related to the developed countries. So the first principal component measures how much the country is developed.

The second component have a strong positive correlation with PP and labFM, when PP stands for percetange of female representatives in parliament and labFM - female proportion in the labour force compared to male one. I would say that this component show how equal women are compared to the men.

Tea data

Now we are going to do MCA on the tea dataset.

# download "tea"
library(FactoMineR)
library(tidyr)
data(tea)
tea <- dplyr::select(tea, one_of(c("tea.time", "tearoom", "Tea","age_Q", "sex")))
tea <- filter(tea, complete.cases(tea))
gather(tea) %>% ggplot(aes(value)) +
  facet_wrap("key", scales = "free") +
  geom_bar() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size= 9))

mca <- MCA(tea, graph = FALSE)
plot(mca, invisible = c("ind"))

What I notice is that women are more likely to have tea time and tearooms that men. Also black tea is the most popular among old people 60+. Young people between 25-34 don’t usually have any tea time.


Exercise 6

1. Analysis of RATS data

In this part, we are going to analyze the longitudinal data with the help of different graphs. We will not use any linear mixed-effects models.

For of all, let us download the data and explore it.

RATS <- read.csv("data/rats.csv")
RATS_o <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt", header = TRUE, sep = '\t')

RATS$ID <- factor(RATS$ID)
RATS$Group <- factor(RATS$Group)

Our data from is a nutrition study conducted in three groups of rats. The groups were put on different diets, and each animal???s body weight (grams) was recorded repeatedly (approximately) weekly, except in week seven when two recordings were taken) over a 9-week period. The question of most interest is whether the growth profiles of the three groups differ.

Individuals on the plot

To begin we will plot the WEIGHT values for all rats, differentiating between the diet groups into which the rats have been randomized.

library(ggplot2)
ggplot(RATS, aes(x=time, y=weight, linetype = ID)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times = 2)) +
  facet_grid(. ~ Group, labeller = label_both) +
  theme(legend.position = 'none') +
  scale_y_continuous(limits = c(min(RATS$weight), max(RATS$weight)))

We notice that the initial weights of the rats in the 1st group are a lot smaller than those of the rats in other groups. And we might suggest that the diet of the 2nd and 3rd groups are better for the weight gain.

Also, one rat in the second group seems to be an outlier. We will check later if it should be removed.

Summary plot

Next, we will take a look at the plot which contains means and standard error of means.

library(tidyr)
library(dplyr)
n <- RATS$times %>% unique() %>% length()
RATS_means <- RATS %>%
  group_by(Group, time) %>%
  summarise( mean = mean(weight), se = sd(weight)/sqrt(n) ) %>%
  ungroup()

ggplot(RATS_means, aes(x = time, y = mean, linetype = as.factor(Group), shape = as.factor(Group))) +
  geom_line() +
  geom_point(size=2) +
  geom_errorbar(aes(ymin = mean - se, ymax = mean + se, linetype="1"), width=0.3) +
  theme(legend.position = c(0.8,0.5)) +
  scale_y_continuous(name = "mean(weight) +/- se(weight)")

Looking at this graph, I have the same feeling as previously. The rats on the 2nd and 3rd diets gained more weight than the rats that had the 1st diet. But we have to notice that the standard errors of the means are a lot smaller in the 1st group.

Outlier.

We noticed one outlier. Let us check if it should be removed.

RATS_out <- RATS %>%
  filter(time > 1) %>%
  group_by(Group, ID) %>%
  summarise( mean=mean(weight) ) %>%
  ungroup()

ggplot(RATS_out, aes(x = Group, y = mean)) +
  geom_boxplot() +
  stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "white") +
  scale_y_continuous(name = "mean(weight), weeks 1-8")

We notice 1 outlier that differs significantly from others. I think that we can remove it.

RATS_t <- RATS_out %>%
  filter(mean < 560) 

Anova

Now it is time to make an anova test.

RATS_anova <- RATS_out %>%
  mutate(baseline = RATS_o$WD1) 
fit <- lm(mean ~ baseline + Group, data = RATS_anova)
anova(fit)
## Analysis of Variance Table
## 
## Response: mean
##           Df Sum Sq Mean Sq   F value   Pr(>F)    
## baseline   1 253625  253625 1859.8201 1.57e-14 ***
## Group      2    879     439    3.2219  0.07586 .  
## Residuals 12   1636     136                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We notice that the baseline WEIGHT is strongly related to the WEIGHT values taken after the diets have started. These is no evidence of a diet difference after conditioning on the baseline value. That means that we can not say that there is a difference between these diets.

1. Analysis of BPRS data

In this part, we are going to use different linear mixed-effects models to analyze the BPRS data.

The data

We are going to download and explore our data now.

DATA <- read.csv("data/bprs.csv")

The DATA contains measurements of 40 male subjects that were randomly assigned to one of two treatment groups and each subject was rated on the brief psychiatric rating scale (BPRS) measured before treatment began (week 0) and then at weekly intervals for eight weeks. The scale is used to evaluate patients suspected of having schizophrenia.

We want to know if there is any difference between these treatments.

str(DATA)
## 'data.frame':    360 obs. of  5 variables:
##  $ treatment: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ subject  : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ weeks    : Factor w/ 9 levels "week0","week1",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ bprs     : int  42 58 54 55 72 48 71 30 41 57 ...
##  $ week     : int  0 0 0 0 0 0 0 0 0 0 ...
DATA$treatment <- factor(DATA$treatment)
DATA$subject <- factor(DATA$subject)
ggplot(DATA, aes(x = week, y = bprs, linetype = subject)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ treatment, labeller = label_both) +
  theme(legend.position = "none") + 
  scale_y_continuous(limits = c(min(DATA$bprs), max(DATA$bprs)))

Models

We are going to create a few models now. We will choose the best one.

library(lme4)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following object is masked from 'package:tidyr':
## 
##     expand
model_1 <- lmer(bprs ~ week + treatment + (1|subject), data = DATA, REML = FALSE)
model_2 <- lmer(bprs ~ week + treatment + (week | subject), data = DATA, REML = FALSE)
model_3 <- lmer(bprs ~ week + treatment + (week | subject) + week*treatment, data = DATA, REML = FALSE)

anova(model_1, model_2, model_3)
## Data: DATA
## Models:
## model_1: bprs ~ week + treatment + (1 | subject)
## model_2: bprs ~ week + treatment + (week | subject)
## model_3: bprs ~ week + treatment + (week | subject) + week * treatment
##         Df    AIC    BIC  logLik deviance  Chisq Chi Df Pr(>Chisq)  
## model_1  5 2748.7 2768.1 -1369.4   2738.7                           
## model_2  7 2745.4 2772.6 -1365.7   2731.4 7.2721      2    0.02636 *
## model_3  8 2744.3 2775.4 -1364.1   2728.3 3.1712      1    0.07495 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

p-value is the smallest with the model “model_2”. So we are going to use it.

summary(model_2)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject)
##    Data: DATA
## 
##      AIC      BIC   logLik deviance df.resid 
##   2745.4   2772.6  -1365.7   2731.4      353 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.8919 -0.6194 -0.0691  0.5531  3.7977 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 64.8222  8.0512        
##           week         0.9609  0.9803   -0.51
##  Residual             97.4304  9.8707        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     2.1052  22.066
## week         -2.2704     0.2977  -7.626
## treatment2    0.5722     1.0405   0.550
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.582       
## treatment2 -0.247  0.000

t value of the “treatment2” is 0.550, which means that there is no difference between the 1st and 2nd treatment.